Strong mitigation: nesting search for good policies within search for good reward
نویسندگان
چکیده
Recent work has defined an optimal reward problem (ORP) in which an agent designer, with an objective reward function that evaluates an agent’s behavior, has a choice of what reward function to build into a learning or planning agent to guide its behavior. Existing results on ORP show weak mitigation of limited computational resources, i.e., the existence of reward functions so that agents when guided by them do better than when guided by the objective reward function. These existing results ignore the cost of finding such good reward functions. We define a nested optimal reward and control architecture that achieves strong mitigation of limited computational resources. We show empirically that the designer is better off using the new architecture that spends some of its limited resources learning a good reward function instead of using all of its resources to optimize its behavior with respect to the objective reward function.
منابع مشابه
Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic
Real-time dynamic programming (RTDP) is a heuristic search algorithm for solving MDPs. We present a modified algorithm called Focused RTDP with several improvements. While RTDP maintains only an upper bound on the long-term reward function, FRTDP maintains two-sided bounds and bases the output policy on the lower bound. FRTDP guides search with a new rule for outcome selection, focusing on part...
متن کاملUsing Online Tools to Assess Public Responses to Climate Change Mitigation Policies in Japan
As a member of the Annex 1 countries to the Kyoto Protocol of the United Nations Framework Convention on Climate Change, Japan is committed to reducing 6% of the greenhouse gas emissions. In order to achieve this commitment, Japan has undertaken several major mitigation measures, one of which is the domestic measure that includes ecologically friendly lifestyle programs, utilizing natural energ...
متن کاملOPTIMAL DESIGN OF SINGLE-LAYER BARREL VAULT FRAMES USING IMPROVED MAGNETIC CHARGED SYSTEM SEARCH
The objective of this paper is to present an optimal design for single-layer barrel vault frames via improved magnetic charged system search (IMCSS) and open application programming interface (OAPI). The IMCSS algorithm is utilized as the optimization algorithm and the OAPI is used as an interface tool between analysis software and the programming language. In the proposed algorithm, magnetic c...
متن کاملInverse Reinforcement Learning based on Critical State
Inverse reinforcement learning is tried to search a reward function based on Markov Decision Process. In the IRL topics, experts produce some good traces to make agents learn and adjust the reward function. But the function is difficult to set in some complicate problems. In this paper, Inverse Reinforcement Learning based on Critical State (IRLCS) is proposed to search a succinct and meaningfu...
متن کاملVoltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm
A new technique presents to improve the performance of dynamic voltage restorer (DVR) for voltage sag mitigation. This control scheme is based on cuckoo search algorithm with tree fuzzy rule based classifier (CSA-TFRC). CSA is used for optimizing the output of TFRC so the classification output of the network is enhanced. While, the combination of cuckoo search algorithm, fuzzy and decision tree...
متن کامل